This document summarises the data from https://www.microbiologyresearch.org/content/journal/mgen/10.1099/mgen.0.001081. This study was produced by Michael Hall from Zam’s group and describes the development of the DRPRG tool that allows for drug resistance prediction using genome graphs.

In doing so, Michael collated and carefully(!) curated an enriched WHO dataset, which is summarised here.

Summary:

A breakdown of R/S phenotypes for the samples can be seen in the plot below:

pheno_data %>%
  select(-run, -bioproject, -biosample) %>%
  summarise(across(everything(), ~ list(R = sum(. == "R"), S = sum(. == "S")))) %>%
  mutate(phenotype = c("R", "S")) %>% 
  pivot_longer(-phenotype, names_to = "antibiotic", values_to = "value") %>%
  as.data.frame() %>%
  mutate(value = as.numeric(as.character(value))) %>%
  mutate(antibiotic = case_when(antibiotic == "para-aminosalicylic_acid" ~ "PAS",
                                TRUE ~ antibiotic)) %>%
    ggplot(., aes(x = antibiotic, y = value)) +
    geom_bar(aes(fill = phenotype), position = "dodge",stat = "identity") +
    theme_bw() +
    theme(axis.text.x = element_text(angle = 90, vjust = 0, size = 8)) +
  geom_text(
    stat = "identity", 
    aes(label = value, group = phenotype),
    vjust = -1,
    position = position_dodge(width = 0.9),
    size = 2.5) +
  ylab("# samples") +
  labs(title = "Data used for DRPRG: WHO samples + extra samples",
       subtitle = paste0("n = ", n_samples, " samples"),
       caption = "There is at least one drug phenotype for each sample")

The WHO data (minus the “enriched” part) was used to train the WHO catalogue. The “enriched” dataset minus the WHO data (e.g. the data from other publications that wasn’t included in the WHO training data) is as follows:

Summary:

A breakdown of R/S phenotypes for the samples can be seen in the plot below:

not_WHO_data %>%
  select(-run, -bioproject, -biosample) %>%
  summarise(across(everything(), ~ list(R = sum(. == "R"), S = sum(. == "S")))) %>%
  mutate(phenotype = c("R", "S")) %>% 
  pivot_longer(-phenotype, names_to = "antibiotic", values_to = "value") %>%
  as.data.frame() %>%
  mutate(value = as.numeric(as.character(value))) %>%
  mutate(antibiotic = case_when(antibiotic == "para-aminosalicylic_acid" ~ "PAS",
                                TRUE ~ antibiotic)) %>%
    ggplot(., aes(x = antibiotic, y = value)) +
    geom_bar(aes(fill = phenotype), position = "dodge",stat = "identity") +
    theme_bw() +
    theme(axis.text.x = element_text(angle = 90, vjust = 0, size = 8)) +
  geom_text(
    stat = "identity", 
    aes(label = value, group = phenotype),
    vjust = -1,
    position = position_dodge(width = 0.9),
    size = 2.5) +
  ylab("# samples") +
  labs(title = "Data used for DRPRG: extra samples only",
       subtitle = paste0("n = ", n_samples_not_WHO, " samples"),
       caption = "There is at least one drug phenotype for each sample")

How many CRyPTIC samples in the WHO training dataset?

cryptic = fread("/Users/kmalone/macbook_m1_backup/github/cryptic-catalogue/data/CRyPTIC_reuse_table_20231107.csv")

# mismatch_WHO_cryptic_run = cryptic %>%
#   dplyr::rename("run" = ENA_RUN) %>%
#   anti_join(., who, by = "run")
# 
# mismatch_WHO_cryptic_run_dedup = mismatch_WHO_cryptic_run %>%
#   separate(run, into = c("run1", "run2"), sep = "\\.", extra = "merge") %>%
#   pivot_longer(cols = starts_with("run"), names_to = "run_type", values_to = "ENA_RUN") %>%
#   filter(ENA_RUN != "") %>%
#   select(-run_type)
# 
# mismatch_WHO_cryptic_run_dedup %>%
#   dplyr::rename("run" = ENA_RUN) %>%
#   anti_join(., who, by = "run")
# 
# mismatch_WHO_cryptic_ENA = cryptic %>%
#   dplyr::rename("sample" = ENA_SAMPLE) %>%
#   anti_join(., who, by = "sample")


mismatch_pheno_cryptic_run = cryptic %>%
    separate(ENA_RUN, into = c("run1", "run2"), sep = "\\.", extra = "merge") %>%
    pivot_longer(cols = starts_with("run"), names_to = "run_type", values_to = "run") %>%
    filter(run != "") %>%
    anti_join(., pheno_data, by = "run") %>%
  dplyr::rename("biosample" = ENA_SAMPLE)
## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 12163 rows [1, 2, 3, 4,
## 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].
mismatch_pheno_cryptic_ENA = cryptic %>%
  dplyr::rename("biosample" = ENA_SAMPLE) %>%
  anti_join(., pheno_data, by = "biosample") %>%
  dplyr::rename("run" = ENA_RUN)


mismatch_pheno_cryptic_ENA %>%
    separate(run, into = c("run1", "run2"), sep = "\\.", extra = "merge") %>%
    pivot_longer(cols = starts_with("run"), names_to = "run_type", values_to = "run") %>%
    filter(run != "") %>%
    anti_join(., pheno_data, by = "run") %>% 
  select(run, biosample)
## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 3392 rows [1, 2, 3, 4, 5,
## 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].
## # A tibble: 981 × 2
##    run        biosample 
##    <chr>      <chr>     
##  1 ERR4810795 ERS5298822
##  2 ERR4810696 ERS5298723
##  3 ERR4810731 ERS5298758
##  4 ERR4810730 ERS5298757
##  5 ERR4810729 ERS5298756
##  6 ERR4811033 ERS5300751
##  7 ERR4810668 ERS5298695
##  8 ERR4810689 ERS5298716
##  9 ERR4810721 ERS5298748
## 10 ERR4811028 ERS5300746
## # ℹ 971 more rows
#sites
mismatch_pheno_cryptic_ENA %>%
    separate(run, into = c("run1", "run2"), sep = "\\.", extra = "merge") %>%
    pivot_longer(cols = starts_with("run"), names_to = "run_type", values_to = "run") %>%
    filter(run != "") %>%
    anti_join(., pheno_data, by = "run") %>% 
  dplyr::mutate(x = str_split(UNIQUEID, "\\.subj") %>% map_chr(1)) %>% 
  with(table(.$x))
## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 3392 rows [1, 2, 3, 4, 5,
## 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].
## 
## site.02 site.03 site.04 site.05 site.06 site.08 site.10 site.11 site.14 site.17 
##      82      22       3       2     357      36       4     456      12       5 
## site.20 
##       2

How was this data created?

According to the README.md in the data source, the phenotype data can be found in config/illumina.samplesheet.csv.
It was synthesized by summarising the experiments found in config/samplesheets/ using workflow/notebook/notepad.ipynb. This notebook is large and includes various analyses but the one we are interested in is “Data Collation”.

The first steps involved creating a WHO base dataset with two different data sources.

  1. “gentb”, which is labelled as WHO correspondence. This is the data underlying Maha Farhat’s GenTB paper.

  2. “WHO”, which I’m assuming is from the latest catalogue.

  3. Lots of data cleaning, resolving discrepancies between the two datasets and extra metadata added using code from https://github.com/mbhall88/WHO-correspondence/blob/main/docs/fill_in_who_samplesheet.py.

Next, data was sequentially added and cleaned data from various publications:

dataset reference phenotype method
gentb https://genomemedicine.biomedcentral.com/articles/10.1186/s13073-021-00953-4 various
trisakil https://doi.org/10.1080/22221751.2022.2099304 various
smith https://pubmed.ncbi.nlm.nih.gov/33055186/ liquid MGIT 960 system (Bactec MGIT SIRE and PZA package inserts; Becton, Dickinson) and solid 7H10 agar proportion method
peker https://doi.org/10.1099/mgen.0.000695 incredibly not noted, only reference in methods is 'All MTB isolates were phenotypically tested for drug susceptibility (phenotypic DST)'
merker https://doi.org/10.1038/s41467-022-32455-1 various
finci https://pubmed.ncbi.nlm.nih.gov/35907429/ BACTEC MGIT 960 DST and Sensititre MYCOTB MIC plate (binary results reported)
leah_bdq https://doi.org/10.1101/2022.12.08.519610 BACTEC MGIT 960 DST
marco_pheno https://doi.org/10.3389/fmicb.2023.1104456 not stated, just that they were 'according to WHO classification'
lempens_acc https://doi.org/10.1016/j.ijid.2020.08.042 LJ slopes and 7H11 plates, proportional

Then, the following was noted: “Get the BioProject of all BioSamples with antibiogram data in NCBI. Once I have the BioProject, I can …..download the antibiogram table”.

This resulted in an extra 1073 samples being added to the superset with phenotypes for at least one drug.

bioproj reference phenotype method
PRJNA353873 MGIT, MICs listed
PRJNA413593 MGIT, MICs listed
PRJNA438921 MGIT, MICs listed
PRJNA557083 MGIT, MICs listed
PRJNA650381 MGIT and proportional agar, MICs listed for both
PRJNA663350 MGIT, MICs listed
PRJNA717333 96 well plate, MICs listed
PRJNA824124 MGIT, MICs listed
PRJNA834625 LJ slopes, MICs listed
PRJNA888434 MGIT, MICs listed

All data together:

dataset reference phenotype method
gentb https://genomemedicine.biomedcentral.com/articles/10.1186/s13073-021-00953-4 various
trisakil https://doi.org/10.1080/22221751.2022.2099304 various
smith https://pubmed.ncbi.nlm.nih.gov/33055186/ liquid MGIT 960 system (Bactec MGIT SIRE and PZA package inserts; Becton, Dickinson) and solid 7H10 agar proportion method
peker https://doi.org/10.1099/mgen.0.000695 incredibly not noted, only reference in methods is 'All MTB isolates were phenotypically tested for drug susceptibility (phenotypic DST)'
merker https://doi.org/10.1038/s41467-022-32455-1 various
finci https://pubmed.ncbi.nlm.nih.gov/35907429/ BACTEC MGIT 960 DST and Sensititre MYCOTB MIC plate (binary results reported)
leah_bdq https://doi.org/10.1101/2022.12.08.519610 BACTEC MGIT 960 DST
marco_pheno https://doi.org/10.3389/fmicb.2023.1104456 not stated, just that they were 'according to WHO classification'
lempens_acc https://doi.org/10.1016/j.ijid.2020.08.042 LJ slopes and 7H11 plates, proportional
PRJNA353873 MGIT, MICs listed
PRJNA413593 MGIT, MICs listed
PRJNA438921 MGIT, MICs listed
PRJNA557083 MGIT, MICs listed
PRJNA650381 MGIT and proportional agar, MICs listed for both
PRJNA663350 MGIT, MICs listed
PRJNA717333 96 well plate, MICs listed
PRJNA824124 MGIT, MICs listed
PRJNA834625 LJ slopes, MICs listed
PRJNA888434 MGIT, MICs listed